Authorship Attribution with Latent Dirichlet Allocation
نویسندگان
چکیده
The problem of authorship attribution – attributing texts to their original authors – has been an active research area since the end of the 19th century, attracting increased interest in the last decade. Most of the work on authorship attribution focuses on scenarios with only a few candidate authors, but recently considered cases with tens to thousands of candidate authors were found to be much more challenging. In this paper, we propose ways of employing Latent Dirichlet Allocation in authorship attribution. We show that our approach yields state-of-the-art performance for both a few and many candidate authors, in cases where these authors wrote enough texts to be modelled effectively.
منابع مشابه
Authorship Attribution with Author-aware Topic Models
Authorship attribution deals with identifying the authors of anonymous texts. Building on our earlier finding that the Latent Dirichlet Allocation (LDA) topic model can be used to improve authorship attribution accuracy, we show that employing a previously-suggested Author-Topic (AT) model outperforms LDA when applied to scenarios with many authors. In addition, we define a model that combines ...
متن کاملStopwords and Stylometry: A Latent Dirichlet Allocation Approach
We illustrate the utility of generative models for the purpose of stylometry – the science of author attribution. Though content words provide semantic handles and intuitively relate to author-styles, they are usually associated with a large vocabulary and are not consistent across corpora. On the contrary, stopwords are limited in number and do not suffer from the above mentioned issues and ye...
متن کاملAuthor's personal copy Authorship attribution based on a probabilistic topic model
This paper describes, evaluates and compares the use of Latent Dirichlet allocation (LDA) as an approach to authorship attribution. Based on this generative probabilistic topic model, we can model each document as a mixture of topic distributions with each topic specifying a distribution over words. Based on author profiles (aggregation of all texts written by the same writer) we suggest comput...
متن کاملAuthorship attribution based on a probabilistic topic model
This paper describes, evaluates and compares the use of Latent Dirichlet allocation (LDA) as an approach to authorship attribution. Based on this generative probabilistic topic model, we can model each document as a mixture of topic distributions with each topic specifying a distribution over words. Based on author profiles (aggregation of all texts written by the same writer) we suggest comput...
متن کاملLearning Stylometric Representations for Authorship Analysis
Authorship analysis (AA) is the study of unveiling the hidden properties of authors from a body of exponentially exploding textual data. It extracts an author’s identity and sociolinguistic characteristics based on the reflected writing styles in the text. It is an essential process for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, mo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011